AITopics | mitigating ai deception

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Neural Information Processing SystemsDec-23-2025, 19:22:03 GMT

Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful).There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formaldefinition of deception in structural causal games, grounded in the philosophyliterature, and applicable to real-world machine learning systems.Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception.Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

best policy, deception, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Neural Information Processing SystemsOct-9-2024, 11:31:12 GMT

Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful).There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formaldefinition of deception in structural causal games, grounded in the philosophyliterature, and applicable to real-world machine learning systems.Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception.Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

best policy, deception, mitigating ai deception, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)

Add feedback

Experiments with Detecting and Mitigating AI Deception

Sahbane, Ismail, Ward, Francis Rhys, Åslund, C Henrik

arXiv.org Artificial IntelligenceJun-26-2023

How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.

agent, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2306.14816

Country: